AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
warnings.filterwarnings('ignore')
df = pd.read_csv("Loan_Modelling.csv")
df2 = pd.read_csv("Loan_Modelling.csv")
df.head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 6 | 37 | 13 | 29 | 92121 | 4 | 0.4 | 2 | 155 | 0 | 0 | 0 | 1 | 0 |
| 6 | 7 | 53 | 27 | 72 | 91711 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 7 | 8 | 50 | 24 | 22 | 93943 | 1 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 9 | 35 | 10 | 81 | 90089 | 3 | 0.6 | 2 | 104 | 0 | 0 | 0 | 1 | 0 |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
# lets look at structure of data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
df.shape
(5000, 14)
#find out if there are any missing values in the dataset
df.isna().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
There are no missing values in the data
df=df.drop(['ID'],axis=1)
df.describe(include = 'all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.0 | 93437.0 | 94608.0 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
df.nunique()
Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
df.Family.unique()
array([4, 3, 1, 2])
df.Education.unique()
array([1, 2, 3])
df.CCAvg.unique()
array([ 1.6 , 1.5 , 1. , 2.7 , 0.4 , 0.3 , 0.6 , 8.9 , 2.4 ,
0.1 , 3.8 , 2.5 , 2. , 4.7 , 8.1 , 0.5 , 0.9 , 1.2 ,
0.7 , 3.9 , 0.2 , 2.2 , 3.3 , 1.8 , 2.9 , 1.4 , 5. ,
2.3 , 1.1 , 5.7 , 4.5 , 2.1 , 8. , 1.7 , 0. , 2.8 ,
3.5 , 4. , 2.6 , 1.3 , 5.6 , 5.2 , 3. , 4.6 , 3.6 ,
7.2 , 1.75, 7.4 , 2.67, 7.5 , 6.5 , 7.8 , 7.9 , 4.1 ,
1.9 , 4.3 , 6.8 , 5.1 , 3.1 , 0.8 , 3.7 , 6.2 , 0.75,
2.33, 4.9 , 0.67, 3.2 , 5.5 , 6.9 , 4.33, 7.3 , 4.2 ,
4.4 , 6.1 , 6.33, 6.6 , 5.3 , 3.4 , 7. , 6.3 , 8.3 ,
6. , 1.67, 8.6 , 7.6 , 6.4 , 10. , 5.9 , 5.4 , 8.8 ,
1.33, 9. , 6.7 , 4.25, 6.67, 5.8 , 4.8 , 3.25, 5.67,
8.5 , 4.75, 4.67, 3.67, 8.2 , 3.33, 5.33, 9.3 , 2.75])
df.Securities_Account.unique()
array([1, 0])
df.CD_Account.unique()
array([0, 1])
df.CreditCard.unique()
array([0, 1])
# While doing uni-variate analysis of numerical variables we want to study their central tendency and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15,10), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
sns.set(font_scale=2) # setting the font scale of the seaborn
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='red') # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=False, ax=ax_hist2, bins=bins) if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='g', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
histogram_boxplot(df["Age"])
histogram_boxplot(df["Experience"])
histogram_boxplot(df["Income"])
histogram_boxplot(df["ZIPCode"])
histogram_boxplot(df["CCAvg"])
Observation
histogram_boxplot(df["Mortgage"])
# Function to create barplots that indicate percentage for each category.
def perc_on_bar(plot, feature):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
plt.show() # show the plot
plt.figure(figsize=(15,5))
ax = sns.countplot(df["Personal_Loan"],palette='winter')
perc_on_bar(ax,df["Personal_Loan"])
plt.figure(figsize=(15,5))
ax = sns.countplot(df["Education"],palette='winter')
perc_on_bar(ax,df["Education"])
plt.figure(figsize=(15,5))
ax = sns.countplot(df['Securities_Account'],palette='winter')
perc_on_bar(ax,df["Securities_Account"])
plt.figure(figsize=(15,5))
ax = sns.countplot(df['CD_Account'],palette='winter')
perc_on_bar(ax,df["CD_Account"])
plt.figure(figsize=(15,5))
ax = sns.countplot(df['Online'],palette='winter')
perc_on_bar(ax,df["Online"])
plt.figure(figsize=(15,5))
ax = sns.countplot(df['CreditCard'],palette='winter')
perc_on_bar(ax,df["CreditCard"])
tab1 = pd.crosstab(df.Age,df.Personal_Loan,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(df.Age,df.Personal_Loan,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Personal_Loan 0 1 All Age 23 12 0 12 24 28 0 28 25 53 0 53 26 65 13 78 27 79 12 91 28 94 9 103 29 108 15 123 30 119 17 136 31 118 7 125 32 108 12 120 33 105 15 120 34 116 18 134 35 135 16 151 36 91 16 107 37 98 8 106 38 103 12 115 39 127 6 133 40 117 8 125 41 128 8 136 42 112 14 126 43 134 15 149 44 107 14 121 45 114 13 127 46 114 13 127 47 103 10 113 48 106 12 118 49 105 10 115 50 125 13 138 51 119 10 129 52 130 15 145 53 101 11 112 54 128 15 143 55 116 9 125 56 121 14 135 57 120 12 132 58 133 10 143 59 123 9 132 60 117 10 127 61 110 12 122 62 114 9 123 63 92 16 108 64 70 8 78 65 66 14 80 66 24 0 24 67 12 0 12 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
tab1 = pd.crosstab(df.Experience,df.Personal_Loan,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(df.Experience,df.Personal_Loan,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Personal_Loan 0 1 All Experience -3 4 0 4 -2 15 0 15 -1 33 0 33 0 59 7 66 1 66 8 74 2 76 9 85 3 112 17 129 4 104 9 113 5 132 14 146 6 107 12 119 7 109 12 121 8 101 18 119 9 127 20 147 10 111 7 118 11 103 13 116 12 86 16 102 13 106 11 117 14 121 6 127 15 114 5 119 16 114 13 127 17 114 11 125 18 125 12 137 19 121 14 135 20 131 17 148 21 102 11 113 22 111 13 124 23 131 13 144 24 123 8 131 25 128 14 142 26 120 14 134 27 115 10 125 28 127 11 138 29 112 12 124 30 113 13 126 31 92 12 104 32 140 14 154 33 110 7 117 34 115 10 125 35 130 13 143 36 102 12 114 37 103 13 116 38 80 8 88 39 75 10 85 40 53 4 57 41 36 7 43 42 8 0 8 43 3 0 3 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(15,13))
sns.boxplot(y='Income', x='Personal_Loan', data=df);
plt.figure(figsize=(15,13))
sns.swarmplot(y='ZIPCode', x='Personal_Loan', data=df);
tab1 = pd.crosstab(df.Family,df.Personal_Loan,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(df.Family,df.Personal_Loan,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Personal_Loan 0 1 All Family 1 1365 107 1472 2 1190 106 1296 3 877 133 1010 4 1088 134 1222 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
The family of 3 and 4 are the bank customers who took the highest percentage of personal loans.
plt.figure(figsize=(15,13))
sns.boxplot(y='CCAvg', x='Personal_Loan', data=df);
tab1 = pd.crosstab(df.Education,df.Personal_Loan,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(df.Education,df.Personal_Loan,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Personal_Loan 0 1 All Education 1 2003 93 2096 2 1221 182 1403 3 1296 205 1501 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(15,13))
sns.boxplot(y='Mortgage', x='Personal_Loan', data=df);
plt.figure(figsize=(15,13))
sns.barplot(y='Securities_Account', x='Personal_Loan', data=df)
<AxesSubplot:xlabel='Personal_Loan', ylabel='Securities_Account'>
**Higher number of securities account holder too personal loans
tab1 = pd.crosstab(df.Securities_Account,df.Personal_Loan,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(df.Securities_Account,df.Personal_Loan,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Personal_Loan 0 1 All Securities_Account 0 4058 420 4478 1 462 60 522 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
9% of securities account and Non Securites account holders took personal loan
plt.figure(figsize=(15,13))
sns.boxplot(y='CD_Account', x='Personal_Loan', data=df)
<AxesSubplot:xlabel='Personal_Loan', ylabel='CD_Account'>
All Certificate of Deposit account holders took personal loans
tab1 = pd.crosstab(df.CD_Account,df.Personal_Loan,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(df.CD_Account,df.Personal_Loan,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(17,9))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Personal_Loan 0 1 All CD_Account 0 4358 340 4698 1 162 140 302 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(25,15))
sns.heatmap(df.corr(),annot=True)
plt.show()
sns.pairplot(data=df,hue="Personal_Loan",)
plt.show()
# Creating dummies
dummy_df = pd.get_dummies(df, columns=['Family','Education','Securities_Account','Online','CreditCard','CD_Account','CreditCard'],drop_first=True)
dummy_df.head()
| Age | Experience | Income | ZIPCode | CCAvg | Mortgage | Personal_Loan | Family_2 | Family_3 | Family_4 | Education_2 | Education_3 | Securities_Account_1 | Online_1 | CreditCard_1 | CD_Account_1 | CreditCard_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 1.6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 1.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 2.7 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 1.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
column_names = list(dummy_df.columns)
column_names.remove('Personal_Loan') # Keep only names of features by removing the name of target variable
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'ZIPCode', 'CCAvg', 'Mortgage', 'Family_2', 'Family_3', 'Family_4', 'Education_2', 'Education_3', 'Securities_Account_1', 'Online_1', 'CreditCard_1', 'CD_Account_1', 'CreditCard_1']
X = dummy_df.drop(['Personal_Loan'],axis=1)
y = dummy_df['Personal_Loan'].astype('int64')
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(3500, 16) (1500, 16)
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.10,1:0.90} to the model to specify the weight of each class and the decision tree will give more weightage to class 1 base on the percental ratio of customer who took personal loan or not(10%,90%)
class_weight is a hyperparameter for the decision tree classifier.
model = DecisionTreeClassifier(criterion='gini',class_weight={0:0.1,1:0.9},random_state=30)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.1, 1: 0.9}, random_state=30)
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
make_confusion_matrix(model,y_test)
y_train.value_counts(1)
0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64
We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 91% accuracy, hence accuracy is not a good metric to evaluate here.
True Positives:
True Negatives:
False Positives:
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
get_recall_score(model)
Recall on training set : 1.0 Recall on test set : 0.8791946308724832
plt.figure(figsize=(20,30))
out = tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None,)
#below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model,feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- CreditCard_1 <= 0.50 | | | |--- weights: [170.90, 0.00] class: 0 | | |--- CreditCard_1 > 0.50 | | | |--- weights: [72.60, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account_1 <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Mortgage <= 102.50 | | | | | |--- Income <= 68.50 | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | |--- Income > 68.50 | | | | | | |--- CCAvg <= 3.05 | | | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | | | |--- CCAvg > 3.05 | | | | | | | |--- Family_4 <= 0.50 | | | | | | | | |--- ZIPCode <= 94714.50 | | | | | | | | | |--- ZIPCode <= 90437.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- ZIPCode > 90437.50 | | | | | | | | | | |--- Age <= 63.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- Age > 63.50 | | | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | | |--- ZIPCode > 94714.50 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- Family_4 > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Mortgage > 102.50 | | | | | |--- Experience <= 4.00 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Experience > 4.00 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- Online_1 <= 0.50 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- Online_1 > 0.50 | | | | | |--- weights: [2.40, 0.00] class: 0 | | |--- CD_Account_1 > 0.50 | | | |--- weights: [0.00, 4.50] class: 1 |--- Income > 92.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- Income <= 103.50 | | | | | | |--- CCAvg <= 3.21 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.21 | | | | | | | |--- ZIPCode <= 91485.50 | | | | | | | | |--- weights: [0.00, 2.70] class: 1 | | | | | | | |--- ZIPCode > 91485.50 | | | | | | | | |--- weights: [0.50, 0.00] class: 0 | | | | | |--- Income > 103.50 | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | |--- weights: [30.20, 0.00] class: 0 | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | |--- weights: [13.10, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- Income <= 102.00 | | | | | | | |--- CCAvg <= 4.50 | | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | | | |--- CCAvg > 4.50 | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Income > 102.00 | | | | | | | |--- weights: [0.00, 17.10] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- Income <= 108.50 | | | | | |--- weights: [1.10, 0.00] class: 0 | | | | |--- Income > 108.50 | | | | | |--- Age <= 26.00 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Age > 26.00 | | | | | | |--- ZIPCode <= 90019.50 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- ZIPCode > 90019.50 | | | | | | | |--- Income <= 118.00 | | | | | | | | |--- CCAvg <= 2.00 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | | |--- CCAvg > 2.00 | | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | |--- Income > 118.00 | | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | | |--- weights: [0.00, 11.70] class: 1 | | | | | | | | |--- Online_1 > 0.50 | | | | | | | | | |--- weights: [0.00, 18.00] class: 1 | | |--- Education_2 > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- Income <= 106.50 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | |--- Income > 106.50 | | | | | | |--- Experience <= 27.00 | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | |--- Experience > 27.00 | | | | | | | |--- Age <= 56.00 | | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | | | |--- Age > 56.00 | | | | | | | | |--- Mortgage <= 54.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | | |--- Mortgage > 54.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- ZIPCode <= 95083.00 | | | | | | |--- Income <= 93.50 | | | | | | | |--- weights: [0.00, 0.90] class: 1 | | | | | | |--- Income > 93.50 | | | | | | | |--- weights: [0.00, 4.50] class: 1 | | | | | |--- ZIPCode > 95083.00 | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | |--- Income > 110.50 | | | | |--- Income <= 116.50 | | | | | |--- Mortgage <= 141.50 | | | | | | |--- Age <= 60.50 | | | | | | | |--- CCAvg <= 1.20 | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.20 | | | | | | | | |--- ZIPCode <= 94887.00 | | | | | | | | | |--- CCAvg <= 2.65 | | | | | | | | | | |--- Age <= 39.50 | | | | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | | | | |--- Age > 39.50 | | | | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 2.65 | | | | | | | | | | |--- weights: [0.00, 4.50] class: 1 | | | | | | | | |--- ZIPCode > 94887.00 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Age > 60.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Mortgage > 141.50 | | | | | | |--- weights: [0.40, 0.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- CreditCard_1 <= 0.50 | | | | | | |--- weights: [0.00, 67.50] class: 1 | | | | | |--- CreditCard_1 > 0.50 | | | | | | |--- weights: [0.00, 29.70] class: 1 | |--- Education_3 > 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.35 | | | | |--- Mortgage <= 236.00 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- Mortgage > 236.00 | | | | | |--- CCAvg <= 1.25 | | | | | | |--- Age <= 39.50 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Age > 39.50 | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | |--- CCAvg > 1.25 | | | | | | |--- Mortgage <= 285.00 | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- Mortgage > 285.00 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- CCAvg > 2.35 | | | | |--- Age <= 64.00 | | | | | |--- ZIPCode <= 90389.50 | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | |--- ZIPCode > 90389.50 | | | | | | |--- CCAvg <= 2.95 | | | | | | | |--- CCAvg <= 2.55 | | | | | | | | |--- ZIPCode <= 92601.00 | | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | | |--- ZIPCode > 92601.00 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | | |--- CCAvg > 2.55 | | | | | | | | |--- CD_Account_1 <= 0.50 | | | | | | | | | |--- weights: [0.40, 0.00] class: 0 | | | | | | | | |--- CD_Account_1 > 0.50 | | | | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | | |--- CCAvg > 2.95 | | | | | | | |--- Mortgage <= 210.50 | | | | | | | | |--- CD_Account_1 <= 0.50 | | | | | | | | | |--- weights: [0.00, 12.60] class: 1 | | | | | | | | |--- CD_Account_1 > 0.50 | | | | | | | | | |--- Age <= 46.50 | | | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | | | |--- Age > 46.50 | | | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | |--- Mortgage > 210.50 | | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | | |--- weights: [0.00, 1.80] class: 1 | | | | | | | | |--- CCAvg > 4.20 | | | | | | | | | |--- Age <= 33.00 | | | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | | | | | | |--- Age > 33.00 | | | | | | | | | | |--- weights: [0.20, 0.00] class: 0 | | | | |--- Age > 64.00 | | | | | |--- Online_1 <= 0.50 | | | | | | |--- weights: [0.10, 0.00] class: 0 | | | | | |--- Online_1 > 0.50 | | | | | | |--- weights: [0.20, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 102.60] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Income 6.280873e-01 CCAvg 9.529227e-02 Family_4 8.039578e-02 Education_2 6.628501e-02 Family_3 6.237187e-02 Education_3 2.259653e-02 Mortgage 1.278058e-02 ZIPCode 1.144117e-02 Age 1.073498e-02 CD_Account_1 7.802904e-03 Experience 1.645121e-03 Securities_Account_1 5.664782e-04 CreditCard_1 9.046809e-15 Online_1 5.318548e-16 Family_2 0.000000e+00 CreditCard_1 0.000000e+00
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(20,20))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1,class_weight = {0:.1,1:.9})
# Grid of parameters to choose from
parameters = {
'max_depth': np.arange(1,10),
'criterion': ['entropy','gini'],
'splitter': ['best','random'],
'min_impurity_decrease': [0.000001,0.00001,0.0001],
'max_features': ['log2','sqrt']
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.1, 1: 0.9}, criterion='entropy',
max_depth=3, max_features='log2',
min_impurity_decrease=1e-06, random_state=1)
make_confusion_matrix(estimator,y_test)
get_recall_score(estimator)
Recall on training set : 0.9697885196374623 Recall on test set : 0.9463087248322147
Recall has not improved for both train and test set after hyperparameter tuning and we do not have a generalized model.
plt.figure(figsize=(15,10))
out = tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator,feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- CD_Account_1 <= 0.50 | | |--- CCAvg <= 3.05 | | | |--- weights: [236.70, 0.00] class: 0 | | |--- CCAvg > 3.05 | | | |--- weights: [9.90, 9.00] class: 0 | |--- CD_Account_1 > 0.50 | | |--- Income <= 72.50 | | | |--- weights: [7.00, 0.00] class: 0 | | |--- Income > 72.50 | | | |--- weights: [1.60, 4.50] class: 1 |--- Income > 92.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- weights: [49.40, 52.20] class: 1 | | |--- Education_2 > 0.50 | | | |--- weights: [6.10, 109.80] class: 1 | |--- Education_3 > 0.50 | | |--- Mortgage <= 406.00 | | | |--- weights: [6.20, 112.50] class: 1 | | |--- Mortgage > 406.00 | | | |--- weights: [0.00, 9.90] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
#Here we will see that importance of features has increased
Imp Income 0.739737 Education_2 0.100624 CCAvg 0.089037 Education_3 0.047698 CD_Account_1 0.021151 Mortgage 0.001753 Age 0.000000 Experience 0.000000 ZIPCode 0.000000 Family_2 0.000000 Family_3 0.000000 Family_4 0.000000 Securities_Account_1 0.000000 Online_1 0.000000 CreditCard_1 0.000000 CreditCard_1 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=30,class_weight = {0:0.1,1:0.9})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -5.665605e-16 |
| 1 | 1.805828e-19 | -5.663799e-16 |
| 2 | 3.611656e-19 | -5.660187e-16 |
| 3 | 3.611656e-19 | -5.656575e-16 |
| 4 | 1.444662e-18 | -5.642129e-16 |
| 5 | 1.805828e-18 | -5.624071e-16 |
| 6 | 2.437868e-18 | -5.599692e-16 |
| 7 | 2.925441e-18 | -5.570437e-16 |
| 8 | 1.072662e-17 | -5.463171e-16 |
| 9 | 1.300196e-17 | -5.333152e-16 |
| 10 | 2.275343e-17 | -5.105617e-16 |
| 11 | 2.345771e-17 | -4.871040e-16 |
| 12 | 5.265794e-17 | -4.344461e-16 |
| 13 | 2.419448e-16 | -1.925013e-16 |
| 14 | 4.441163e-15 | 4.248662e-15 |
| 15 | 1.596972e-04 | 3.193943e-04 |
| 16 | 1.617559e-04 | 6.429061e-04 |
| 17 | 1.621398e-04 | 9.671857e-04 |
| 18 | 3.079874e-04 | 1.275173e-03 |
| 19 | 3.081875e-04 | 1.583361e-03 |
| 20 | 3.081875e-04 | 1.891548e-03 |
| 21 | 3.105223e-04 | 2.823115e-03 |
| 22 | 3.199567e-04 | 3.143072e-03 |
| 23 | 3.208528e-04 | 3.784777e-03 |
| 24 | 3.212203e-04 | 4.427218e-03 |
| 25 | 4.068475e-04 | 6.461456e-03 |
| 26 | 5.323239e-04 | 6.993779e-03 |
| 27 | 5.753795e-04 | 7.569159e-03 |
| 28 | 6.202198e-04 | 8.809599e-03 |
| 29 | 6.273817e-04 | 9.436980e-03 |
| 30 | 6.784437e-04 | 1.147231e-02 |
| 31 | 7.485805e-04 | 1.222089e-02 |
| 32 | 7.828292e-04 | 1.300372e-02 |
| 33 | 7.840069e-04 | 1.535574e-02 |
| 34 | 8.273599e-04 | 1.618310e-02 |
| 35 | 9.647609e-04 | 1.714786e-02 |
| 36 | 1.176341e-03 | 1.832420e-02 |
| 37 | 1.372398e-03 | 1.969660e-02 |
| 38 | 1.435187e-03 | 2.113179e-02 |
| 39 | 1.985898e-03 | 2.510358e-02 |
| 40 | 2.127748e-03 | 2.723133e-02 |
| 41 | 2.328917e-03 | 2.956025e-02 |
| 42 | 2.909596e-03 | 3.246985e-02 |
| 43 | 3.240232e-03 | 3.571008e-02 |
| 44 | 3.393805e-03 | 3.910388e-02 |
| 45 | 3.470671e-03 | 4.604523e-02 |
| 46 | 3.841577e-03 | 4.988680e-02 |
| 47 | 4.980603e-03 | 5.984801e-02 |
| 48 | 5.881704e-03 | 6.572971e-02 |
| 49 | 5.974141e-03 | 7.170385e-02 |
| 50 | 2.132036e-02 | 9.302421e-02 |
| 51 | 2.840493e-02 | 2.066439e-01 |
| 52 | 2.928785e-01 | 4.995225e-01 |
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with on
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=30, ccp_alpha=ccp_alpha,class_weight = {0:0.1,1:0.9})
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.29287854019800325
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post",)
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004980603234980805,
class_weight={0: 0.1, 1: 0.9}, random_state=30)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.004980603234980805,
class_weight={0: 0.1, 1: 0.9}, random_state=30)
make_confusion_matrix(best_model,y_test)
get_recall_score(best_model)
Recall on training set : 0.9879154078549849 Recall on test set : 0.9798657718120806
plt.figure(figsize=(10,10))
out = tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
best_model2 = DecisionTreeClassifier(ccp_alpha=0.005,
class_weight={0: 0.1, 1: 0.9}, random_state=30)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.005, class_weight={0: 0.1, 1: 0.9},
random_state=30)
make_confusion_matrix(best_model2,y_test)
get_recall_score(best_model2)
Recall on training set : 0.9879154078549849 Recall on test set : 0.9798657718120806
plt.figure(figsize=(15,10))
out = tree.plot_tree(best_model2,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2,feature_names=feature_names,show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [243.50, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [11.70, 13.50] class: 1 |--- Income > 92.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [47.80, 2.70] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [0.20, 18.00] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- weights: [1.40, 31.50] class: 1 | | |--- Education_2 > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- weights: [4.70, 0.90] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- weights: [0.20, 5.40] class: 1 | | | |--- Income > 110.50 | | | | |--- weights: [1.20, 103.50] class: 1 | |--- Education_3 > 0.50 | | |--- weights: [6.20, 122.40] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(best_model2.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Income 0.679714 Family_4 0.086636 Education_2 0.075308 Family_3 0.070862 CCAvg 0.061869 Education_3 0.025612 Age 0.000000 Experience 0.000000 ZIPCode 0.000000 Mortgage 0.000000 Family_2 0.000000 Securities_Account_1 0.000000 Online_1 0.000000 CreditCard_1 0.000000 CD_Account_1 0.000000 CreditCard_1 0.000000
importances = best_model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
comparison_frame3 = pd.DataFrame({'Model':['Initial decision tree model','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[1,0.97,0.99], 'Test_Recall':[0.88,0.95,0.98], 'True_Positive':[8.73,9.40,9.73], 'False_Positive':[88.93,70.07,84.07]})
comparison_frame3
| Model | Train_Recall | Test_Recall | True_Positive | False_Positive | |
|---|---|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.88 | 8.73 | 88.93 |
| 1 | Decision treee with hyperparameter tuning | 0.97 | 0.95 | 9.40 | 70.07 |
| 2 | Decision tree with post-pruning | 0.99 | 0.98 | 9.73 | 84.07 |
Decision tree model with Post-Pruning has given the best recall score on data
cols4 = comparison_frame3[['Train_Recall','Test_Recall','True_Positive','False_Positive']].columns.tolist()
plt.figure(figsize=(50,50))
for i, variable in enumerate(cols4):
plt.subplot(3,2,i+1)
sns.barplot(comparison_frame3['Model'],comparison_frame3[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
False_prediction_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'False_Negative':[1.2,0.53,0.20], 'True_Negative':[1.13,20.0,6.0]})
False_prediction_frame
| Model | False_Negative | True_Negative | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.20 | 1.13 |
| 1 | Decision treee with hyperparameter tuning | 0.53 | 20.00 |
| 2 | Decision tree with post-pruning | 0.20 | 6.00 |
cols = False_prediction_frame[['False_Negative','True_Negative']].columns.tolist()
plt.figure(figsize=(48,48))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.barplot(False_prediction_frame['Model'],False_prediction_frame[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
# outlier detection using boxplot
numerical_col = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(df[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
def treat_outliers(df,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
df: data frame
col: name of the column
'''
Q1=df[col].quantile(0.25) # 25th quantile
Q3=df[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whishker will be assigned value of upper_Whisker
return df
def treat_outliers_all(df, col_list):
'''
treat outlier in all numerical varaibles
col_list: list of numerical varaibles
df: data frame
'''
for c in col_list:
df = treat_outliers(df,c)
return df
# Creating dummies
dummy_dc = pd.get_dummies(df, columns=['Family','Education','Securities_Account','Online','CreditCard','CD_Account','CreditCard'],drop_first=True)
dummy_dc.head()
| Age | Experience | Income | ZIPCode | CCAvg | Mortgage | Personal_Loan | Family_2 | Family_3 | Family_4 | Education_2 | Education_3 | Securities_Account_1 | Online_1 | CreditCard_1 | CD_Account_1 | CreditCard_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 1.6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 1.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 2.7 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 1.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
column_names = list(dummy_dc.columns)
column_names.remove('Personal_Loan') # Keep only names of features by removing the name of target variable
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'ZIPCode', 'CCAvg', 'Mortgage', 'Family_2', 'Family_3', 'Family_4', 'Education_2', 'Education_3', 'Securities_Account_1', 'Online_1', 'CreditCard_1', 'CD_Account_1', 'CreditCard_1']
x = dummy_dc.drop(['Personal_Loan'],axis=1)
Y = dummy_dc['Personal_Loan'].astype('int64')
x_train, x_test, Y_train, Y_test = train_test_split(x,
Y, test_size=0.30)
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='newton-cg',max_iter=1000,penalty='none',verbose=True,n_jobs=-1,random_state=0)
# There are several optimizer, we are using optimizer called as 'newton-cg' with max_iter equal to 10000
# max_iter indicates number of iteration needed to converge
logreg.fit(x_train, Y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 2.2s finished
LogisticRegression(max_iter=1000, n_jobs=-1, penalty='none', random_state=0,
solver='newton-cg', verbose=True)
#Predict for train set
pred_train = logreg.predict(x_train)
from sklearn.metrics import classification_report,confusion_matrix
def make_confusion_matrix(Y_actual,Y_predict,labels=[1, 0]):
'''
Y_predict: prediction of class
Y_actual : ground truth
'''
cm=confusion_matrix( Y_predict,Y_actual, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ['1','0']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
# let us make confusion matrix on train set
make_confusion_matrix(Y_train,pred_train)
#Predict for test set
pred_test = logreg.predict(x_test)
print("confusion matrix = \n")
make_confusion_matrix(Y_test,pred_test)
confusion matrix =
#Recall with a threhold of 0.5
from sklearn.metrics import recall_score
print('Recall on train data:',recall_score(Y_train, pred_train) )
print('Recall on test data:',recall_score(Y_test, pred_test))
Recall on train data: 0.684971098265896 Recall on test data: 0.7164179104477612
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(Y_test, logreg.predict_proba(x_test)[:,1])
fpr, tpr, thresholds = roc_curve(Y_test, logreg.predict_proba(x_test)[:,1])
plt.figure(figsize=(16,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0.0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(optimal_threshold)
0.0011769104183964726
target_names = ['Personal_Loan']
Y_pred_tr = (logreg.predict_proba(x_train)[:,1]>optimal_threshold).astype(int)
Y_pred_ts = (logreg.predict_proba(x_test)[:,1]>optimal_threshold).astype(int)
make_confusion_matrix(Y_test,Y_pred_ts)
#Accuracy with a threhold of 0.5
from sklearn.metrics import recall_score
print('Recall on train data:',recall_score(Y_train, pred_train) )
print('Recall on test data:',recall_score(Y_test, pred_test))
Recall on train data: 0.684971098265896 Recall on test data: 0.7164179104477612
After using optimal threshold we see that true positives have increased from 6% to 9%. and false Negatives has decreased from 90% to 29%
There are different ways of detecting(or testing) multi-collinearity, one such way is Variation Inflation Factor.
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is “inflated”by the existence of correlation among the predictor variables in the model.
General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.
# dataframe with numerical column only
num_feature_set = x.copy()
from statsmodels.tools.tools import add_constant
num_feature_set = add_constant(num_feature_set)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
Series before feature selection: const 3289.498758 Age 94.136285 Experience 93.975583 Income 1.877736 ZIPCode 1.005639 CCAvg 1.730881 Mortgage 1.051875 Family_2 1.400632 Family_3 1.386139 Family_4 1.427808 Education_2 1.294940 Education_3 1.336068 Securities_Account_1 1.138172 Online_1 1.040865 CreditCard_1 inf CD_Account_1 1.337056 CreditCard_1 inf dtype: float64
# droping variables of perfect collinearity
variables_with_prefect_collinearity = vif_series1[vif_series1.values==np.inf].index.tolist()
num_feature_set = num_feature_set.drop(variables_with_prefect_collinearity,axis=1)
num_feature_set2 = x_train.drop('Age', axis=1)
vif_series2 = pd.Series([variance_inflation_factor(num_feature_set2.values,i) for i in range(num_feature_set2.shape[1])],index=num_feature_set2.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series2))
Series before feature selection: Experience 4.052941 Income 6.860266 ZIPCode 14.751671 CCAvg 3.964978 Mortgage 1.369517 Family_2 1.867983 Family_3 1.715648 Family_4 1.854191 Education_2 1.786308 Education_3 1.759731 Securities_Account_1 1.266059 Online_1 2.583818 CreditCard_1 inf CD_Account_1 1.418195 CreditCard_1 inf dtype: float64
Income and Zipcode have high VIF. Zipcode will be drop becasue it has the highest VIF.
x_train, x_test, Y_train, Y_test = train_test_split(num_feature_set, Y, test_size=0.30)
import statsmodels.api as sm
logit = sm.Logit(Y_train, x_train)
lg = logit.fit()
Optimization terminated successfully.
Current function value: 0.112720
Iterations 9
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3485
Method: MLE Df Model: 14
Date: Fri, 11 Jun 2021 Pseudo R-squ.: 0.6391
Time: 16:54:30 Log-Likelihood: -394.52
converged: True LL-Null: -1093.2
Covariance Type: nonrobust LLR p-value: 6.015e-290
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
const -10.4159 5.418 -1.922 0.055 -21.036 0.204
Age -0.0282 0.081 -0.346 0.729 -0.188 0.131
Experience 0.0371 0.081 0.458 0.647 -0.122 0.196
Income 0.0623 0.004 16.664 0.000 0.055 0.070
ZIPCode -2.361e-05 5.4e-05 -0.437 0.662 -0.000 8.23e-05
CCAvg 0.2729 0.057 4.786 0.000 0.161 0.385
Mortgage 0.0002 0.001 0.230 0.818 -0.001 0.002
Family_2 -0.3755 0.288 -1.306 0.192 -0.939 0.188
Family_3 2.1252 0.298 7.132 0.000 1.541 2.709
Family_4 1.5178 0.286 5.303 0.000 0.957 2.079
Education_2 3.9016 0.335 11.637 0.000 3.244 4.559
Education_3 4.0748 0.333 12.223 0.000 3.421 4.728
Securities_Account_1 -0.8301 0.388 -2.139 0.032 -1.591 -0.070
Online_1 -0.5003 0.199 -2.515 0.012 -0.890 -0.110
CD_Account_1 2.9888 0.357 8.371 0.000 2.289 3.689
========================================================================================
Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
num_feature_set3 = x_train.drop('Experience', axis=1)
vif_series3 = pd.Series([variance_inflation_factor(num_feature_set3.values,i) for i in range(num_feature_set3.shape[1])],index=num_feature_set3.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series3))
Series before feature selection: const 2883.960985 Age 1.015706 Income 1.876654 ZIPCode 1.008498 CCAvg 1.733109 Mortgage 1.045051 Family_2 1.412355 Family_3 1.398398 Family_4 1.442551 Education_2 1.288351 Education_3 1.258757 Securities_Account_1 1.110752 Online_1 1.037865 CD_Account_1 1.196153 dtype: float64
All the variables are below 5.
logit2 = sm.Logit(Y_train, x_train)
lg2 = logit2.fit()
print(lg2.summary())
Optimization terminated successfully.
Current function value: 0.112720
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3485
Method: MLE Df Model: 14
Date: Fri, 11 Jun 2021 Pseudo R-squ.: 0.6391
Time: 16:54:30 Log-Likelihood: -394.52
converged: True LL-Null: -1093.2
Covariance Type: nonrobust LLR p-value: 6.015e-290
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
const -10.4159 5.418 -1.922 0.055 -21.036 0.204
Age -0.0282 0.081 -0.346 0.729 -0.188 0.131
Experience 0.0371 0.081 0.458 0.647 -0.122 0.196
Income 0.0623 0.004 16.664 0.000 0.055 0.070
ZIPCode -2.361e-05 5.4e-05 -0.437 0.662 -0.000 8.23e-05
CCAvg 0.2729 0.057 4.786 0.000 0.161 0.385
Mortgage 0.0002 0.001 0.230 0.818 -0.001 0.002
Family_2 -0.3755 0.288 -1.306 0.192 -0.939 0.188
Family_3 2.1252 0.298 7.132 0.000 1.541 2.709
Family_4 1.5178 0.286 5.303 0.000 0.957 2.079
Education_2 3.9016 0.335 11.637 0.000 3.244 4.559
Education_3 4.0748 0.333 12.223 0.000 3.421 4.728
Securities_Account_1 -0.8301 0.388 -2.139 0.032 -1.591 -0.070
Online_1 -0.5003 0.199 -2.515 0.012 -0.890 -0.110
CD_Account_1 2.9888 0.357 8.371 0.000 2.289 3.689
========================================================================================
Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Calculate the probability from the odds ratio using the formula probability = odds / (1+odds)
#Calculate Odds Ratio, probability
##create a data frame to collate Odds ratio, probability and p-value of the coef
lgcoef = pd.DataFrame(lg.params, columns=['coef'])
lgcoef.loc[:, "Odds_ratio"] = np.exp(lgcoef.coef)
lgcoef['probability'] = lgcoef['Odds_ratio']/(1+lgcoef['Odds_ratio'])
lgcoef['pval']=lg.pvalues
pd.options.display.float_format = '{:.2f}'.format
# Filter by significant p-value (pval <0.005) and sort descending by Odds ratio
lgcoef = lgcoef.sort_values(by="Odds_ratio", ascending=False)
pval_filter = lgcoef['pval']<=0.005
lgcoef[pval_filter]
| coef | Odds_ratio | probability | pval | |
|---|---|---|---|---|
| Education_3 | 4.07 | 58.84 | 0.98 | 0.00 |
| Education_2 | 3.90 | 49.48 | 0.98 | 0.00 |
| CD_Account_1 | 2.99 | 19.86 | 0.95 | 0.00 |
| Family_3 | 2.13 | 8.37 | 0.89 | 0.00 |
| Family_4 | 1.52 | 4.56 | 0.82 | 0.00 |
| CCAvg | 0.27 | 1.31 | 0.57 | 0.00 |
| Income | 0.06 | 1.06 | 0.52 | 0.00 |
# we are looking are overall significant varaible
pval_filter = lgcoef['pval']<=0.0001
imp_vars = lgcoef[pval_filter].index.tolist()
# we are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in imp_vars:
if '_' in col:
first_part = col.split('_')[0]
for c in df.columns:
if first_part in c and c not in sig_var :
sig_var.append(c)
start = '\033[1m'
end = '\033[95m'
print('Most significant varaibles category wise are :\n',lgcoef[pval_filter].index.tolist())
print('*'*120)
print(start+'Most overall significant varaibles are '+end,':\n',sig_var)
Most significant varaibles category wise are : ['Education_3', 'Education_2', 'CD_Account_1', 'Family_3', 'Family_4', 'CCAvg', 'Income'] ************************************************************************************************************************ Most overall significant varaibles are : ['Education', 'CD_Account', 'Family']
Education_3 and Education_2 are the most significant varaible for prediction
pred_train = lg.predict(x_train)
pred_train = np.round(pred_train)
print("confusion matrix = \n")
make_confusion_matrix(Y_train,pred_train )
confusion matrix =
pred_ts = lg.predict(x_test)#predict(X_train)
pred_ts = np.round(pred_ts)
# mat_tst = confusion_matrix(y_test,np.round(pred_ts))
print("confusion matrix = \n")
make_confusion_matrix(Y_test,pred_ts )
confusion matrix =
#Recall with optimal threhold
from sklearn.metrics import recall_score
print('Recall on train data:',recall_score(Y_train, Y_pred_tr) )
print('Recall on test data:',recall_score(Y_test, Y_pred_ts))
Recall on train data: 0.6757575757575758 Recall on test data: 0.64
fpr, tpr, thresholds = roc_curve(Y_test, lg.predict(x_test))
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
pred_train = lg.predict(x_train)
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(Y_train, pred_train)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(optimal_threshold)
0.11557125127797124
target_names = ['Personal_Loan']
Y_pred_tr = (lg.predict(x_train)>optimal_threshold).astype(int)
Y_pred_ts = (lg.predict(x_test)>optimal_threshold).astype(int)
make_confusion_matrix(Y_train,Y_pred_tr )
make_confusion_matrix(Y_test,Y_pred_ts)
print('Recall on train data:',recall_score(Y_train,Y_pred_tr) )
print('Recall on test data:',recall_score(Y_test,Y_pred_ts))
Recall on train data: 0.8939393939393939 Recall on test data: 0.8466666666666667
import pandas as pd
comparison_frame = pd.DataFrame({'Model':['Initial Logistic Regression Model with sklearn', 'Optimal threshold - Logistic Regression Model with sklearn','Logistic Regression with feature elimination one by one','Optimal threshold - Logistic Regression with feature elimination one by one',
], 'Train_Recall':[0.64,0.69,0.73, 0.89], 'Test_Recall':[0.63,0.68, 0.75, 0.85],'True_Positives':[6.77, 8.9, 6.54, 8.48],'False_Positives':[89.71,31.33,89.60,83.74], 'AUC ROC':[96,96,96,96
]})
comparison_frame
| Model | Train_Recall | Test_Recall | True_Positives | False_Positives | AUC ROC | |
|---|---|---|---|---|---|---|
| 0 | Initial Logistic Regression Model with sklearn | 0.64 | 0.63 | 6.77 | 89.71 | 96 |
| 1 | Optimal threshold - Logistic Regression Model ... | 0.69 | 0.68 | 8.90 | 31.33 | 96 |
| 2 | Logistic Regression with feature elimination o... | 0.73 | 0.75 | 6.54 | 89.60 | 96 |
| 3 | Optimal threshold - Logistic Regression with f... | 0.89 | 0.85 | 8.48 | 83.74 | 96 |
cols3 = comparison_frame[['Train_Recall','Test_Recall','True_Positives','False_Positives']].columns.tolist()
plt.figure(figsize=(100,90))
for i, variable in enumerate(cols3):
plt.subplot(3,2,i+1)
sns.barplot(comparison_frame['Model'],comparison_frame[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
import pandas as pd
comparison_frame2 = pd.DataFrame({'Model':['Initial Logistic Regression Model', 'Optimal threshold - Logistic Regression','Logistic Regression with Multicolinarity','Optimal threshold - Logistic Regression with Multicolinarity',
], 'False_Negative':[0.94,0.97,0.97, 6.83], 'True_Negative':[3.11,0.00, 2.89, 1.00
]})
comparison_frame2
| Model | False_Negative | True_Negative | |
|---|---|---|---|
| 0 | Initial Logistic Regression Model | 0.94 | 3.11 |
| 1 | Optimal threshold - Logistic Regression | 0.97 | 0.00 |
| 2 | Logistic Regression with Multicolinarity | 0.97 | 2.89 |
| 3 | Optimal threshold - Logistic Regression with M... | 6.83 | 1.00 |
cols2 = comparison_frame2[['False_Negative','True_Negative']].columns.tolist()
plt.figure(figsize=(60,48))
for i, variable in enumerate(cols2):
plt.subplot(3,2,i+1)
sns.barplot(comparison_frame2['Model'],comparison_frame2[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
From my observation and chosing Decision Tree Model as the best model for prediction here are my recommendations
According to the decision tree model -
a) If a customer lands on a page with an Income greater than 92,500usd there's a very high chance the customer of the customer taking a personal loan.
b) If a customer lands on a page with an Income less than 92,500 and make enquries about a personal loan then giving a good intrest rate there is a very high chance that the customer will take a personal loan.
It is advised that customers with family of four should be a target market as they me require perosnal loan with a big family and more individauls to cater for. It will be recommended that there be a joint family loan package, attractive intrest rate and felixible payment plans to encourage families to personal loan to build the banks assets.
Graduate and Advance/professional have access to better paying and structured jobs directing affecting customer income. The bank marketing staff should develop a personal loan package for employees of companies. Partners with comapnies will draw organization to business with the bank and being able to guaranty employee who want to take personal loans.
It is observed that most that undergraduates and low income eaners dont take alot of personal loan. Most organizations have undergradtes on their payroll. A partnership will draw in undergradates and income earners less than 92,500usd to apply for loan if the organization can guaranty ther staff.
From the EDA customers with a high credit card monthly and credit rating should be a target for personal loans. This will discourage customers from using othe banks credit cards.
Customer retention and training - Customers should be encourage to take personal loans by giving lower interst rate on the next personal loan and education customer on the importance of increasing their credit rating
Better resource management - Application for personal loans should be encourage to be applied for on the online platform. The personal loan staff should have at least 6hrs respond to customers.